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Abstract 



Approximate Bayesian Computation is a family of likelihood-free inference techniques that 
are well-suited to models defined in terms of a stochastic generating mechanism. In a nut- 
shell, Approximate Bayesian Computation proceeds by computing summary statistics Sobs 
from the data and simulating summary statistics for different values of the parameter 0. 
The posterior distribution is then approximated by an estimator of the conditional den- 
sity g{Q\Sobs)- In this paper, we derive the asymptotic bias and variance of the standard 
estimators of the posterior distribution which are based on rejection sampling and linear 
adjustment. Additionally, we introduce an original estimator of the posterior distribution 
based on quadratic adjustment and we show that its bias contains a fewer number of terms 
than the estimator with linear adjustment. Although we find that the estimators with ad- 
justment are not universally superior to the estimator based on rejection sampling, we find 
that they can achieve better performance when there is a nearly homoscedastic relationship 
between the summary statistics and the parameter of interest. To make this relationship 
as homoscedastic as possible, we propose to use transformations of the summary statistics. 
In different examples borrowed from the population genetics and epidemiological literature, 
we show the potential of the methods with adjustment and of the transformations of the 
summary statistics. Supplemental materials containing the details of the proofs are available 
online. 

Keywords: Conditional density estimation, implicit statistical model, simulation-based 
inference, kernel regression, local polynomial 
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1. INTRODUCTION 

Inference in Bayesian statistics relies on the full posterior distribution defined as 

where O G M''' denotes the vector of parameters and D denotes the observed data. The ex- 
pression given in ([1]) depends on the prior distribution vr(0), the likelihood function p{D\Q) 
and the marginal probability of the data p{D) = jQp{D\Q)'n-{Q) dQ. However, when the sta- 
tistical model is defined in term of a stochastic generating mechanism, the likelihood can be 
computationally intractable. Such difficulties typically arise when the generating mechanism 
involves a high- dimensional variable which is not observed. The likelihood is accordingly ex- 
pressed as a high-dimensional integral over this missing variable and can be computationally 
intractable. Methods of inference in the context of these so-called implicit statistical models 
have been proposed by Diggle and Gratton (1984) in a frequentist setting. Implicit statistical 
models can be thought of as a computer generating mechanism that mimics data generation. 
In the past ten years, interests in implicit statistical models have reappeared in population 
genetics where Beaumont et al. (2002) gave the name of Approximate Bayesian Computation 
(ABC) to a family of likelihood-free inference methods. 

Since its original developments in population genetics (Fu and Li 1997; Tavare et al. 
1997; Pritchard et al. 1999; Beaumont et al. 2002), ABC has successfully been applied in a 
large range of scientific fields such as archaeological science (Wilkinson and Tavare 2009), 
ecology (Frangois et al. 2008; Jabot and Chave 2009), epidemiology (Tanaka et al. 2006; 
Blum and Tran 2008), stereology (Bortot et al. 2007) or in the context of protein networks 
(Ratmann et al. 2007). Despite the increasing number of ABC applications, theoretical 
results concerning its properties are still lacking and the present paper contributes to filling 
this gap. 

In ABC, inference is no more based on the full posterior distribution g{Q\D) but on the 
partial posterior distribution g{Q\Sobs) where Sgbs denotes a vector of ci- dimensional summary 
statistics computed from the data D. The partial posterior distribution is defined as (Doksum 
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and Lo 1990) 



Of course, the partial and the full posterior distributions are the same if the summary 
statistics are sufficient with respect to the parameter 9. 

To generate a sample from the partial posterior distribution g{Q\Sobs), ABC proceeds 
by simulating n values ©j, i — 1 . . .n from the prior distribution tt, and then simulating 
summary statistics Sj according to p(s|0j). Once the couples (Bj, Sj), i = 1 . . . n, have been 
obtained, the estimation of the partial posterior distribution is a problem of conditional den- 
sity estimation. Here we will derive the asymptotic bias and variance of a Nadaraya- Watson 
type estimator (Nadaraya 1964; Watson 1964), of an estimator with hnear adjustment pro- 
posed by Beaumont et al. (2002), and of an original estimator with quadratic adjustment 
that we propose. 

Although replacing the full posterior by the partial one is an approximation crucial in 
ABC, wc will not investigate its consequences here. The reader is referred to Le Cam 
(1964) and Abril (1994) for theoretical works on the concept of approximate sufficiency; 
and to Joyce and Marjoram (2008) for a practical method that selects informative summary 
statistics in ABC. Here, we concentrate on the second type of approximation arising from 
the discrepancy between the estimated partial posterior distribution and the true partial 
posterior distribution. 

In this paper, we investigate the asymptotic bias and variance of the estimators of the 
posterior distribution g{0\sobs) {0 G M) of a one-dimensional coordinate of G. Section 2 
introduces parameter inference in ABC. Section 3 presents the main theorem concerning 
the asymptotic bias and variance of the estimators of the partial posterior. To decrease 
the bias of the different estimators, we propose, in Section 4, to use transformations of the 
summary statistics. In Section 5, we show some applications of ABC in population genetics 
and epidemiology . 
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2. PARAMETER INFERENCE IN ABC 

2.1 Smooth rejection 

In the context of ABC, the Nadaraya- Watson estimator of the partial posterior mean i?[G|Sobs] 
can be written as 

where i^^s(u) = |i?|^-'^i^(i?^-'^u), B is the bandwidth matrix that is assumed to be non- 
singular, if is a d-variate kernel such that J K{u.) du = 1, and \B\ denotes the determinant 
of B. Typical choices of kernel encompass spherically symmetric kernels K{u) = Ki{\\u\\), 
in which ||u|| denotes the Euclidean norm of u and Ki denotes a one-dimensional kernel. 
To estimate the partial posterior distribution g{6\Sobs) of a one- dimensional coordinate of G, 
we introduce a kernel K that is a symmetric density function on M. Here we will restrict 
our analysis to univariate density estimation but multivariate density estimation can also be 
implemented in the same vein. The bandwidth corresponding to K is denoted b' {b' > 0) 
and we use the notation Kbi{-) = K{-/b')/b'. As the bandwidth b' goes to 0, a simple Taylor 
expansion shows that 

Ee'[Kb'{e'-e)\sobs]^9{e\sobs). 

The estimation of the partial posterior distribution g{6\Sobs) can thus be viewed as a problem 
of nonparametric regression. After substituting 0j by Ky{9i — 9) in equation (|3]), we obtain 
the following estimator of g{9\Sobs) (Rosenblatt 1969) 

2^i=l^B{Si - Sobs) 

The initial rejection-based ABC estimator consisted of using a kernel K that took or 1 
values (Pritchard et al. 1999). This method consisted simply of rejecting the parameter 
values for which the simulated summary statistics were too different from the observed ones. 
Estimation with smooth kernels K was proposed by Beaumont et al. (2002). 
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2.2 Regression adjustment 

Besides introducing smoothing in the ABC algorithm, Beaumont et al. (2002) proposed 
additionally to adjust the 9i's to weaken the effect of the discrepancy between Sj and Sobs- 
In the neighborhood of Sobs, they proposed to approximate the conditional expectation of 9 
given s by mi where 

mi(s) = d + (s — SobsYp for s such that Kb{s — Sobs) > 0. (5) 

The estimates a (a G M) and /3 (/3 G M'') are found by minimizing the weighted sum of 
squared residuals 



WSSR = - (" + - SobsYmKBiSi - Sobs] 



i=l 



The least-squares estimate is given by (Ruppert and Wand 1994) 



(6) 



(d,/3) = {x'wxy^x'we, 



(7) 



where is a diagonal matrix whose i element is Ksisi — Sobs), and 



X 
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^obs 
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and s\ denotes the component of Sj. The principle of regression adjustment consists of 
forming the empirical residuals ei = 9i — ?fii(sj), and to adjust the 9i by computing 

9* = rhi{Sobs) + ei, i = l,...,n. (8) 

Estimation of g{9\sobs) is obtained with the estimator of equation ^ after replacing the 
6'i's by the 6'*'s. This leads to the estimator proposed by Beaumont et al. (2002, eq. (9)) 



^i(^|s 



obs) 



ZtiKb'{9*-9)KB{s,-Sobs) 

YJi=lKB{Si - Sobs) 



(9) 
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To improve the estimation of the conditional mean, we suggest a shght modification to 
gii6\sobs) using a quadratic rather than a hnear adjustment. Adjustment with general non- 
linear regression models was already proposed by Blum and Frangois (2010) in ABC. The 
conditional expectation of 6 given s is now approximated by rh2 where 



m2(s) = a + (s - SobsYfi + -(s - Sofc,)*7(s - Sobs) for s such that Keis - Sobs) > 0. (10) 

The three estimates (a, /3, 7) G M x M'^ x W^'^ are found by minimizing the quadratic extension 
of the least square criterion given in Because 7 is a symmetric matrix, the inference of 7 
only requires the lower triangular part and the diagonal of the matrix to be estimated. The 
solution to this new minimization problem is given by (JTj) where the design matrix X is now 
equal to 



^ ^1 ^obs ^1 ^obs 2 V^l '^obsJy^l '^obsJ ''' 2 



X 



1 ^1 ^2 , , , , , „ „ . f„d .d \2 



y ■'n ^obs ^obs 2 ^ n ^obsJK^n ^obs) 2 j 

Letting d** = m2(sobs) + {(^i — ^2(^1)), the new estimator of the partial posterior distri- 
bution is given by 

n(m. \ EtiKb'{er-e)KB{s.-s,bs) .... 

2^i=l^B{Si - Sobs) 

Estimators with regression adjustment in the same vein as those proposed in equations ([9]) 
and f fTTj) have already been proposed by Hyndman et al. (1996) and Hansen (2004) for 
performing conditional density estimation when d = 1. 

3. ASYMPTOTIC BIAS AND VARIANCE IN ABC 
3.1 Main theorem 

To study the asymptotic bias and variance of the three estimators of the partial posterior 
distribution gj{-\Sobs), j = 0, 1, 2, we assume that the bandwidth matrix is diagonal B = hD. 
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A more general result for non-singular matrix B is given in the Appendix. In practice, the 
bandwidth matrix B may depend on the simulations, but we will assume in this Section 
that it has been fixed independently of the simulations. This assumption facilitates the 
computations and is classical when investigating the asymptotic bias and variance of non- 
parametric estimators (Ruppcrt and Wand 1994). 

The first (resp. second) derivative of a function / with respect the variable x is denoted 
fx (resp. fxx)- When the derivative is taken with respect to a vector x, /x denotes the 
gradient of / and /xx denotes the Hessian of /. The variance-covariance matrix of K is 
assumed to be diagonal and equal to iJ,2{K)Id. We also introduce the following notations 



l_i2(K) = j^v?k{u)du, R{K) = J^K'^{n)dn, and R{k) = J^k^{u)du. Finally, if is a 



sequence of random variables and a„ is a deterministic sequence, the notation X„ = op(a„) 
mean that X„/a„ converges to zero in probability and X„ = Op(a„) means that the ratio 
Xn/ttn stays bounded in the limit in probability. 

Theorem 1 Assume that B = hD, in which b > is the bandwidth associated to the kernel 
K, and assume that conditions (Al):(A5) of the Appendix hold. The bias and the variance 
of the estimators gj{-\Sohs), J = 0, 1, 2, are given by 



E[gM^obs) - g{e\so,s)] = Ci6'' + C2,,-6' + Op{{b^ + b'^f) + Op(^), j = 0, 1, 2, (12) 



Ya.i[gj{e\Sohs)] 



(l + op(l)), J = 0,1, 2, 



(13) 



nb'^b' 



with 



li2{K)g0e{9\sobs) 



1 




(14) 




(15) 
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C2,2 = li2{K) 



( 





(16) 



and 



C3 



R{K)R{K)g{e\so,s) 

\D\p{Sobs) 



(17) 



Remark 1. Curse of dimensionality The mean square error (MSE) of an estimator 
is equal to the sum of its squared bias and its variance. With standard algebra, we find that 
the MSEs of the three estimators gj{-\sobs)i j = 0, 1, 2, are minimized when both b and b' are 
of the order of n^^/^'^+^\ This implies that the minimal MSEs are of the order of n~^^'^'^^^\ 
Thus, the rate at which the minimal MSEs converge to decreases importantly as the 
dimension d of Sobs increases. However, we wish to add words of caution since this standard 
asymptotic argument does not account for the fact that the 'constants' Ci, 6*2,6*3 involved 
in Theorem [1] also depend on the dimension of the summary statistics. Moreover, in the 
context of multivariate density estimation, Scott (1992) argued that conclusions arising from 
the same kind of theoretical arguments were in fact much more pessimistic than the empirical 
evidence. Finally, because the underlying structure of the summary statistics can typically 
be of dimension lower than d, dimension reduction techniques, such as neural networks or 
partial least squares regression, have been proposed (Blum and Frangois 2010; Wegmann 
et al. 2009). 

Remark 2. Effective local size and effect of design As shown by equations ( IT3|) and 
( ITTll . the variance of the estimators can be expressed, up to a constant, as 1. ^(^1^°''") ^ where 
the effective local size is n = n\D\p{sobs)b'^ ■ The effective local size is an approximation 
of the expected number of simulations that fall within the ellipsoid of radii equal to the 
diagonal elements of D times b. Thus equations ( |T3i) and ( |T71) reflect that the variance 
is penalized by sparser simulations around Sobs- Sequential Monte Carlo samplers (Sisson 
et al. 2007; Beaumont et al. 2009; Toni et al. 2009) precisely aim at adapting the sampling 
distribution of the parameters, a.k.a. the design, to increase the probability of targeting close 
to Sobs- Likelihood- free MCMC samplers have also been proposed to increase the probability 
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of targeting close to Sobs (Marjoram et al. 2003; Sisson and Fan 2010). 

Remark 3. A closer look at the bias There are two terms in the bias of go{-\sobs) 
(equation (1141) ) that are related to the smoothing in the space of the summary statistics. 
The first term in equation ( fT4l) corresponds to the effect of the design and is large when 
the gradient of Dp{-) is collinear to the gradient of Dg{6\-). This term reflects that, in the 
neighborhood of s^bs, there will be an excess of points in the direction of Dps{Sobs)- Up to 
a constant, the second term in equation (ITll) is proportional to tT{D'^gss{0\s)\s=s^^J which is 
simply the sum of the elementwise product of D and the Hessian 5'ss(6'|s)|s=Soft^. This second 
term shows that the bias is increased when there is more curvature of g{-\s) at Sobs and more 
smoothing. 

For the estimator g2{-\Sobs) with quadratic adjustment, the asymptotic bias is the same 
as the bias of an estimator for which the conditional mean would be known exactly. Results 
of the same nature were found, for d = 1, hj Fan and Yao (1998) when estimating the con- 
ditional variance and by Hansen (2004) when estimating the conditional density. Compared 
to the bias of g2{-\sobs), the bias of the estimator with linear adjustment gi{-\sobs) contains 
an additional term depending on the curvature of the conditional mean. 

3.2 Bias comparison between the estimators with and without adjustment 
To investigate the differences between the three estimators, we first assume that the partial 
posterior distribution of 6 can be written as h{d — m(s)) in which the function h does not 
depend on s. This amounts to assuming an homoscedastic model in which the conditional 
distribution of 6 given s depends on s only through the conditional mean m(s). If the 
conditional mean m is linear in s, both 6*2,1 and 6*2,2 are null involving that both estimators 
with regression adjustment have a smaller bias than go{-\sobs)- For such ideal models, the 
bandwidth b of the estimators with regression adjustment can be taken infinitely large so 
that the variance will be inversely proportional to the total number of simulations n. Still 
assuming that g{0\s) = h{6 — m{s)), but with a non-linear m, the constant 6*2,2 is null so 
that the estimator g2{-\Sobs) has the smallest asymptotic MSE. However, for general partial 
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posterior distributions, it is not possible to rank the three different biases. Consequently, 
when using the estimators with adjustment, the parameterization of the model should be 
guided toward making the distributions g{9\s) as homoscedastic as possible. To achieve this 
objective, we propose, in the next section, to use transformations of the summary statistics. 

4. CHOOSING A REGRESSION MODEL 

4.1 Transformations of the summary statistics and the parameters 

To make the regression as homoscedastic as possible, we propose to transform the summary 
statistics in equations ([5]) and (fTOj) . Here we consider logarithmic and square root transfor- 
mations only. We choose the transformations that minimize the weighted sum of squared 
residuals (WSSR) given in equation in which we take an uniform kernel for the weight 
function K. The weights Ksi^i — Sots) depend on the transformations of the summary 
statistics and the uniform kernel ensures that the WSSR are comparable for different trans- 
formations. Since there are a total of 3"^ regression models to consider, greedy algorithm can 
be considered for large values of 3*^. 

Although transformations of the parameter 6 in the regression equations ([5]) and (ITOl) 
can also stabilize the variance (Box and Cox 1964), we rather use transformations of 6 
for guaranteeing that the adjusted parameters 9* and 9** lie in the support of the prior 
distribution (Beaumont et al. 2002). For positive parameters, we use a log transformation 
before regression adjustment. After adjusting the logarithm of a positive parameter, we 
return to the original scale using an exponential transformation. Replacing the logarithm 
by a logit transformation, we consider the same procedure for the parameters for which the 
support of the prior is a finite interval. 

4.2 Choosing an estimator of g{-\sobs) 

In Section 3, we find that there is not a ranking of the three estimators gj{-\Sobs), j = 0, 1, 2, 
which is universally valid. Since the three estimators rely on local regressions, of degree 0, 1, 
and 2, we propose to choose the regression model that minimizes the prediction error of the 
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regression. Because the regression models involve a different number of predictors, we use 
cross-validation to evaluate the prediction error. We introduce the following leave-one-out 
estimate 

n 
i=l 

where mj*(si) denotes the estimate of m{9i\si) obtained, in the neighborhood of Sj, with a 
local polynomial of degree j by removing the i*'* point of the training set. 

5. EXAMPLES 

5.1 Example 1: A Gaussian model 

We are interested here in the estimation of the variance parameter in a Gaussian sample. 
Although Approximate Bayesian Computation is not required for such a simple model, this 
example will highlight the potential importance of the transformations of the summary 
statistics and of the methods with adjustment. Assume that we observe a sample of size 
iV = 50 in which each individual is a Gaussian random variable of mean fi and variance cr^. 
We assume the following hierarchical prior for /i and cr^ (Gelman et al. 2003) 

~ Invx^(d.f. = 1) (18) 
At ~ Af{0,(T^), (19) 

where Invx^(d.f. = u) denotes the inverse chi-square distribution with u degrees of freedom, 
and J\f denotes the Gaussian distribution. We consider the empirical mean xn and variance 
s% as the summary statistics. These two statistics are sufficient with respect to the parameter 
(T^ (Gelman et al. 2003). The data come from the well-known Iris data set and consist of the 
sample of the petal lengths for the virginica species {x^ — 5.552, s% — 0.304, and N — 50). 

We perform a total of 100 ABC replicates. Each replicate consists of simulating n = 
20, 000 Gaussian samples. We consider a spherically symmetric kernel for K and an Epanech- 
nikov kernel for Ki. We assume a diagonal bandwidth matrix B — bD where D contains 
the standard deviation of each summary statistic in the diagonal and b is the 2.5% quantile 
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of the distances ||sj — Sobs\\, i = 1 ■ ■ ■ n. This procedure amounts to choosing the 500 simu- 
lation that provide the best match to the observed summary statistics. In the two following 
examples, we consider the same number of simulations, the same bandwidth matrix, and 
the same kernel. Here the true posterior distribution is known exactly (Gelman et al. 2003) 
and can be compared to the different estimates obtained with ABC. Since is a positive 
parameter, its log is regressed as described in Section 4. As displayed by Figure [H the 
estimate with linear adjustment gi{cr^\xj\f, s%) provides a good estimate provided that the 
empirical variance is log-transformed in the regression setting. The WSSR criterion selects 
the right transformation here since it is minimum for the logarithmic transformation in all 
of the 100 test replicates. When considering xn and logs^ in the regression, both the linear 
and the quadratic adjustment provide good estimate of by contrast to the method without 
adjustment (see Figured]). The cross-validation criterion never selects the method without 
adjustment, selects 74 times linear adjustment and 26 times quadratic adjustment. 

5.2 Example 2: Coalescent model in population genetics 

ABC was originally developed for inferring parameters of coalescent models in populations 
genetics (Pritchard et al. 1999). Coalescent models describe, in a probabilistic fashion, the 
tree-like ancestry of genes represented in a sample. Because the ancestral tree is unknown, 
the likelihood involves an integral over this high dimensional ancestral tree and is computa- 
tionally intractable. Here we aim at estimating the Time since the Most Recent Common 
Ancestor (TMRCA) of a sample of gene. This time is equal to the age of the root of the 
ancestral tree. A description of the coalescent process and the TMRCA is given in Figure |2l 
The coalescent prior for the TMRCA and the whole ancestral tree can be described by the 
following hierarchical procedure 

1. Simulate the size of the entire population according to its prior distribution, a 
uniform distribution between and 10,000 here. 

2. Simulate the T^'s, the k^^ inter-coalescence times, as exponential random variables of 
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Figure 1: Estimation of the posterior quantiles of the variance parameter cr^ in a Gaussian 
sample. We perform a total of 100 ABC replicates and we display the boxplots of the 
estimated posterior quantiles. A)Estimation of the posterior quantiles with linear adjustment 
using (xAr,s^), (xat, a/s^), and (xtv, logs^). B) Estimation of the posterior quantiles with 
no adjustment and with linear and quadratic adjustment considering (a;7v,logs^) as the 
summary statistics. The horizontal lines correspond to the true posterior quantiles. In 
this Gaussian example, both log transformation of the empirical variance and regression 
adjustment are crucial for accurate estimation of the posterior distribution. Id. stands for 
the identity function, adj. for adjustment and quadr. for quadratic. 

rate k{k — 1)/(2A^), k = 2 . . . m, where m is the number of sequences in the sample. 
Time is counted in generations here. 

The TMRCA is given by the sum of the inter-coalescence times T2 + ■ ■ ■ + T^. Once the 
genealogical tree has been generated, DNA sequences are simulated by superimposing mu- 
tations along the tree according to a Poisson process of rate u where u is the mutation 
rate. Here we assume that the mutation rate is known and we use u = 1.8 x 10^'^ mu- 
tation/generation for the whole 500 base pairs DNA sequence (Cox 2008). Assuming the 
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infinitely-many- sites model, each mutation hit a so-called segregating site that has never 
been hit before. As summary statistics, we consider the total number of segregating sites S 
and the mean number of mutations between the ancestor and each individual in the sample. 
The latter summary statistic is called the p statistic and is central in the field of molecular 
dating (Cox 2008). 
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Figure 2: Coalescent process for simulating DNA sequences. This example is excerpted from 
Cox (2008). There are a total of ten DNA sequences. We display only the upper part of 
the tree in which mutations occur. We omit the lower part corresponding to the coalescence 
times T^, . . . ,TiQ. The ancestral sequence is a sequence of 500 base pairs and contains a 
repetition of 0. The stars denote the — )■ 1 mutations. To generate this tree, a mutation 
rate of 3.6 x 10^^/base pairs/generation was considered. The true TMRCA is equal to 465 
generations here. To infer the TMRCA, we consider the number of segregating sites S = 6 
and the mean number of mutations between the ancestor and the individuals p = 2.10 as 
summary statistics. 

We infer the TMRCA using the DNA sequences simulated by Cox (2008). The true TM- 
RCA was equal to 465 generations in his simulation and the values of the summary statistics 
are S = 6 and p = 2.10. Since the TMRCA is a positive parameter, we use a logarith- 
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mic transformation when performing the regression adjustment. As shown in Table [H the 
WSSR criterion selects the regression equation log TMRCA = logp + S*. The cross validation 
criterion points to the estimator with quadratic adjustment although the prediction errors 
obtained with the linear and quadratic regressions are almost the same (see Table |2]). In this 
example, we do not observe the dramatic effect of the transformations and the adjustments 
that we found for the Gaussian example. As displayed in Figure [31 both transformations of 
the summary statistics and regression adjustments do not greatly alter the estimated poste- 
rior distribution. Figure |3] also shows that the posterior distribution is clearly more peaky 
than the prior indicating that the summary statistics convey substantial information about 
the TMRCA. The 95% credibility interval of the posterior (400 — 2450) is indeed consider- 
ably narrower than the credibility interval of the prior (300 — 30, 800). However, as is typical 
with molecular dating, there remains considerable uncertainty when estimating the TMRCA 
(Cox 2008). The 95% credibility interval of the TMRCA ranges from a value slightly inferior 
to the true one to a value more than five times larger than the true one. 



Table 1: Choosing a transformation of the summary statistics 
Parameter Sum of squared residuals 
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Table 2: Cross validation criterion for choosing an estimator of the posterior distribution 



Parameter 


No adjustment 


Linear adjustment 


Quadratic adjustment 


TMRCA (Example 2) 
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0.624 
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Figure 3: Posterior distribution of the TMRCA. A) Estimated posterior distributions with 
hnear adjustment considering three different transformations of the summary statistics. The 
summary statistics log p and S provide the smallest residual error. B) Estimated posterior 
distributions using the three different estimates (TMRCA | (log p, S*)), j = 0,1,2. The 
quadratic regression provides the smallest prediction error as found with a leave-one-out 
estimate. For this coalescent example, both transformations of the summary statistics and 
regression adjustments do not greatly alter the estimated posterior distribution 

5.3 Example 3: Birth and death process in epidemiology 

To study the rate at which tuberculosis spread in a human population, Tanaka et al. (2006) 
make use of available genetic data of Mycobacterium tuberculosis isolated from different 
patients. DNA fingerprint at the IS6110 marker were obtained for 473 isolates sampled 
in San Francisco during 1991 and 1992 (Small et al. 1994). The IS6110 fingerprints were 
grouped into 326 distinct genotj^es whose configuration into clusters is represented by 

30^23^15^10^8^524^3^^22°l2*^ 

where indicates there are k clusters of size n. To infer the rate of transmission of the 
disease from this data, Tanaka et al. (2006) introduced a stochastic model of transmission 
and mutation. We denote by Xj(t) the number of cases of type i at time t, by G{t) the 
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current number of distinct genotypes, and by N{t) the total number of cases. The model 
starts with Xi{0) = 1, A^(0) = 1 and G{0) = 1. We denote by a, 6, and 9, the per-capita 
birth rate, death rate and mutation rate. When a birth occurs for an individual of genotype 
i, the value of is incremented by 1. If the event is a death, the value of Xi{t) is 

decremented by 1. When a mutation occurs for an individual of genotype i, we assume the 
infinitely-many- alleles model in which a new allele is formed. This means that the value of 
Xi{t) is decremented by 1 and a case of a new genotype is created. Following Tanaka et al. 
(2006), the process is stopped when = 10,000. At the stopping time, a sample of size 
n = 473 is drawn from the final population randomly without replacement. As summary 
statistics, we consider the total number of genotypes G in the sample and the homozygosity 
H of the sample defined as H = ^(^^^/n)^, where n^, i = 1 . . . G, denotes the number of 
individual of genotype i in the sample. We consider the following prior specification 



The informative prior for 6 (in mutation/year) arises from previous estimations of the mu- 
tation rate (Tanaka et al. 2006). 

We are interested in the estimation of the net transmission rate a — 6, of the doubling 
time of the disease log2/(a; — 6), and of the basic reproduction number Rq = a/6. Since 
they are positive parameters, they are log-transformed in the regression equations. Once log- 
transformed, the transmission rate and the doubling time are equal up to a multiplicative 
constant so that the optimal transformation and adjustment are the same for both parame- 
ters. We find that transforming G and H with the log function is optimal for inferring the 
doubling time whereas log-transforming H only is optimal for inferring Rq (see Table [T]). 
For all parameters, we select linear adjustment based on the cross-validation criterion (see 
Table [2]) . As displayed in Figure HI transformations of the summary statistics and regression 
adjustments do not greatly alter the estimated posterior distributions except when estimat- 



^ ~ A/'(0.20, 0.07^) 



(20) 




(21) 
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ing Rq. For the transmission rate and the doubhng time, the posterior distributions greatly 
differ from the prior distributions (see Figure [Hand Table [3]). However, for the reproduction 
number Rq, the posterior 95% credibihty interval is hardly narrower than the prior credi- 
bility interval. These comparisons between the prior and the posterior distributions suggest 
that the genotype data convey much more information for estimating the transmission rate 
and the doubling time than for estimating the reproduction number Rq. A large credibility 
interval for the parameter Rq was also find by Tanaka et al. (2006). 



Table 3: Posterior estimates of epidemiological quantities for the San Francisco data 



Parameter 


Description 


95% Prior C.l.'^ 


Posterior mode 


95% Posterior C.I.'^ 


a — 6 


Transmission rate (years) 


0.01-9.97 


0.56 


0.16-0.95 


log2/(a-5) 


Doubling time (years) 


0.06-57.85 


1.16 


0.73-4.35 


a/6 


Reproduction number i?o 


1.27-123.32 


4.00 


2.24-117.45 



C.I. stands for credibility intervals 



6. CONCLUSION 

In this paper, we presented Approximate Bayesian Computation as a technique of inference 
that relies on stochastic simulations and non-parametric statistics. We introduced an es- 
timator of g{6\Sobs) based on quadratic adjustment for which the asymptotic bias involves 
fewer terms than the asymptotic bias of the estimator with linear adjustment proposed by 
Beaumont et al. (2002). More generally, we showed that the benefit arising from regres- 
sion adjustment (equations and ( ITTi) ) is all the more important that the distribution of 
the residual e is independent of s in the regression model d{s) = m{s) + e. To make this 
regression model as homoscedastic as possible, we proposed to use transformations of the 
summary statistics when performing regression adjustment. We proposed to select the trans- 
formation of the summary statistics that minimizes the sum of squared residuals within the 
window of the accepted simulations. In a Gaussian example, we showed that transformations 
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Figure 4: Posterior distributions of key epidemiological quantities for the tuberculosis epi- 
demic in San Francisco. The abbreviation transf. stands for transformation. In this example, 
both transformations of the summary statistics and regression adjustments do not greatly 
alter the estimated posterior distributions except when estimating Rq. For the transmission 
rate and the doubling time, the posterior distributions greatly differ from the prior distri- 
butions. For the reproduction number Rq, there is not an important difference between the 
prior and the posterior indicating than the data do not convey enough information for a 
confident estimation of Ro- 
of the summary statistics and regression adjustment can dramatically improve inference in 
ABC. In two other examples borrowed from the population genetics and epidemiology liter- 
ature, regression adjustment and transformations of the summary statistics had little effect 
on the estimated posterior distribution. However, above all, these two examples emphasize 
the potential of ABC for complex models for which the likelihood is not computationally 
tractable. 

APPENDIX 

APPENDIX A. HYPOTHESES OF THEOREM 1 
Al) The kernel K has a finite second order moment such that J uu^K{u.)du = ^2{K)Id 
where ^i2{K) ^ 0. We also require that all first-order moments of K vanish, that 
is, J UjK(u) du = for i = 1, . . . , As noted by Ruppert and Wand (1994), this 

20 



condition is fulfilled by spherically symmetric kernels and product kernels based on 
symmetric univariate kernels. 



A3) The observed summary statistics Sobs he in the interior of the support of p. At Sgbs, all 
the second order derivatives of the function p exist and are continuous. 

A4) The point 9 is in the support of the partial posterior distribution. At the point {6, Sobs), 
all the second order derivatives of the partial posterior g exist and are continuous. The 
conditional mean of 9, m(s), exists in a neighborhood of Sobs and is finite. All its second 
order derivatives exist and are continuous. 

A5) The sequence of non-singular bandwidth matrices B and bandwidths b' is such that 

l/{n\B\b'), each entry of B'^B, and b' tend to as n— > oo. 



The three estimators of the partial posterior distribution gj{-\Sobs), j — 0, 1, 2, are all of the 
Nadaraya- Watson type. The difficulty in the computation of the bias and the variance of the 
Nadaraya- Watson estimator comes form the fact that it is a ratio of two random variables. 
Following Pagan and UUah (1999, p. 98) or Scott (1992), we linearize the estimators in order 
to compute their biases and their variances. We write the estimators of the partial posterior 
distribution gj, j = 0,1,2, as 



A2) The kernel K is a. symmetric univariate kernel with finite second order moment /i2(A'). 



APPENDIX B. PROOF OF THEOREM 1 



gj{0\sobs) 



9j,N 



J = 0,1,2, 



where 




9i,N 



1 
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1 " ~ 

n ' ' 

and 



5'2,N 

n . 



n 



go = ^KsiSi - Sobs)- 
i=l 

To compute the asymptotic expansions of the moments of the three estimators, we use the 
following lemma 

Lemma 1 For j = 0, 1,2, we have 

+Op(Cov(^j- N, ^d) + Var[^D]) (A.l) 

Proof. Lemma [1] is a simple consequence of the Taylor expansion for the function 
(x, y)— > x/y in the neighborhood of the point {E[gj^T<{], -E[^d]) (see Pagan and Ullah (1999) 
for another proof). The order of the reminder follows from the weak law of large numbers. 

□ 

The following Lemma gives an asymptotic expansion of all the expressions involved in 
equation (lA.ip . 



Lemma 2 Suppose assumption (Al)-(A5) hold, denote e = 6 — m{Sobs), then we have 

E[gB] = Pi^obs) + ^fi2iKMBB'pUsobs)) + o(tr(5*5)), (A.2) 

^[^o,n] = Pisobs)9i0\sobs) + lb''^fJ'2iK)geeiO\Sobs)piSobs) 
+fi2{K)[gM^)l=s^^BB%{Sobs) + lg{9\sobsMBB%,{sobs)) 

+ '^p{SobsMBB'gUO\s)\s=s^J] + o{h'^) + o(tr(i?*fi)), (A.3) 

E[gi,^] = p{Sobs)h{e\Sobs) + jb'"^ fi2{K)hee{e\Sobs)p{Sobs) 

+fi2iK)[h^{e\s)l^^^^^BB'p^{sobs) + lh{e\Sobs)tT{BB%^{Sobs)) 
+ lpiSobsMBB'hMs)\,=s^J - ^^^^^tr(55*m,,(s,b,))] 

+o(6'2) + o(tr(5*5)), (A.4) 

22 



-E[^2,n] = p{sobs)h{e\sobs) + Ib'^ fi'2{K)h^^{e\Sobs)pisobs) 
+fi2{K)[hs{e\s)\^^^^^^BB^Ps{sobs) + ^h{e\Sobs)tT{BB^p^^{Sobs)) 

+ ip(s„,,)tr(55*/i,,(e|s)|,=,^^J + o(6'2) + o(tr(5*5)), (A.5) 

Var^j- N = TTi^i +0(-)+0(— -— )+0(— -), j = 0, 1, 2, (A.7) 

no'\B\ n nb'\B\ n\B\ 

<^ow ^j-N,^D = ^ + J = 0,1, 2. (A.8) 

Proof. See the Supplemental Material available online q 
Theorem [1] is a particular case of the following theorem that gives the bias and variance of 
the three estimators of the partial posterior distribution for a general nonsingular bandwidth 
matrix B. 

Theorem 2 Assume that B is a non-singular bandwidth matrix and assume that conditions 
(A1)-(A5) holds, then the bias of c/j, j = 0, 1,2, is given by 



E[g^{e\s,bs)-g{0\Sobs)]=D,b'' + D2,,+Op{{tT{B'B) + b'^f)+Op{^), j = 0,1,2, (A.9) 

n\JD\ 

with 



^ ,,,j 9s{e\s)l^^^^BB%{s^bs) , tTiBB%M^)ls=s^J 

,T^M'^'^^\^y\-=-o,s^^'P'^^"bs) tr(55*/i,,(e|s)|s=s„,J K{e\SobsMBB'm, 
= ''^""^ ' ^) + 2 2 



and 



- ^'^""^ ' ^) ~2 

The variance of the estimators gj , j = 0,1,2, is given by 
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V^ar[,,(^|s.,.)] = (1 + opil)). (A.IO) 

Proof. 

Theorem [2] is a consequence of Lemma [1] and [21 Taking expectations on both sides of 
equation (lA.ip . we find that 



E[9jieKts]) = ^0 + Op[Covig,,^,gn)+Ve.rign)]. (A.ll) 

Using a Taylor expansion, and the equations (lA.2p - (lA.5p . (lA.6p . and (lA.Sp given in 
Lemma [21 we find the bias of the estimators given in equation flA.Qp . 

For the computation of the variance, we find from equation (lA.ip and (lA.lip that 



gAOM - E[g,{eK,,)] - -^^^ + Op{^^- (A.12) 



The order of the reminder follows from equations (1A.6P and ( lA.Sp . Taking the expectation 
of the square of equation (]A.12p . we now find 



r-^ (Q\ n Var[^j-N] , ^[^j,N]^Var[^D] on ^-^ - \E[gjA , i ^ \ i \ 
y-A9A0M) = ^j-p + ^j-p 2Cov(,„,,,)-^ + op(^). (A.13) 



The variance of the estimators given in equation ( lA.lOp follows from a Taylor expansion that 



makes use of equations (IA.2p -( lAl8l) given in Lemma [21 

□ 

APPENDIX C. SUPPLEMENTAL MATERIALS 

Proof of Lemma [2l 
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